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Changes in a cell's external or internal conditions are usually reflected in the concentrations of 
the relevant transcription factors. These proteins in turn modulate the expression levels of the 
genes under their control and sometimes need to perform non-trivial computations that integrate 
several inputs and affect multiple genes. At the same time, the activities of the regulated genes 
would fluctuate even if the inputs were held fixed, as a consequence of the intrinsic noise in the 
system, and such noise must fundamentally limit the reliability of any genetic computation. Here 
we use information theory to formalize the notion of information transmission in simple genetic 
regulatory elements in the presence of physically realistic noise sources. The dependence of this 
"channel capacity" on noise parameters, cooperativity and cost of making signaling molecules is 
explored systematically. We find that, at least in principle, capacities higher than one bit should 
be achievable and that consequently genetic regulation is not limited the use of binary, or "on-off " , 
components. 

PACS numbers: 87.16.Yc, 87.16.Xa, 89.70.+C 



I. INTRODUCTION 

Networks of interacting genes coordinate complex cel- 
lular processes, such as responding to stress, adapting 
the metabolism to a varying diet, maintaining the circa- 
dian cycle or producing an intricate spatial arrangement 
of differentiated cells during development [H EJ |3l |4]. 
The success of such regulatory modules is at least par- 
tially characterized by their ability to produce reliable 
responses to repeated stimuli or changes in the environ- 
ment over a wide dynamic range, and to perform the ge- 
netic computations reproducibly, either on a day-by-day 
or generation timescale. In doing so the regulatory ele- 
ments are confronted by noise arising from physical pro- 
cesses that implement such genetic computations, and 
this noise ultimately traces its origins back to the fact 
that the state variables of the system are concentrations 
of chemicals and "computations" are really reactions be- 
tween individual molecules, usually present at low copy 
numbers [SIE]. 

It is useful to picture the regulatory module as a de- 
vice that, given some input, computes an output, which 
in our case will be a set of expression levels of regu- 
lated genes. Sometimes the inputs to the module are 
easily identified, such as when they are the actual chem- 
icals that a system detects and responds to, for exam- 
ple chemoattractant molecules, hormones or transcrip- 
tion factors (TFs). There are cases, however, when it is 
beneficial to think about the inputs on a more abstract 
level: in embryonic development we talk of "positional 
information" and think of the regulatory module as try- 
ing to produce a different gene expression footprint at 
each spatial location [7 ; alternatively, circadian clocks 
generate distinguishable gene expression profiles corre- 
sponding to various phases of the day [II. Regardless of 
whether we view the input as a physical concentration 
of some transcription factor or perhaps a position within 



the embryo, and whether the computation is complicated 
or as simple as an inversion produced by a repressor, we 
want to quantify its reliability in the presence of noise, 
and ask what the biological system can do to maximize 
this reliability. 

If we make many observations of a genetic regulatory 
element in its natural conditions we are collecting sam- 
ples drawn from a distribution O) , where X describes 
the state of the input and O the state of the output. Say- 
ing that the system is able to produce a reliable response 
O across the spectrum of naturally occurring input con- 
ditions p{X) amounts to saying that the dependency - 
either linear or strongly non-linear - between the input 
and output is high, i.e. far from random. Shannon has 
shown how to associate a unique measure, the mutual 
information /, with the notion of dependency between 
two quantities drawn from a joint distribution [SllQllTO]: 

0) = jj dIdOp{I, O) log, (1) 

The resulting quantity is a measure in bits and is es- 
sentially the logarithm of the number of states in the in- 
put that produce distinguishable outputs given the noise. 
A device that has one bit of capacity can be thought of 
as an "on-off" switch, two bits correspond to four dis- 
tinguishable regulatory settings, and so on. Although 
the input is usually a continuous quantity, such as nutri- 
ent concentration or phase of the day, the noise present 
in the regulatory element corrupts the computation and 
does not allow the arbitrary resolution of a real-valued 
input to propagate to the output; instead, the mutual 
information tells us how precisely different inputs are 
distinguishable to the organism. 

Experimental or theoretical characterization of the 
joint distribution, O), for a regulatory module can 
be very difficult if the inputs and outputs live in a high- 
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dimensional space. We can proceed, nevertheless, by 
remembering that the building blocks of complex mod- 
ules are much simpler, and finally must reduce to the 
point where a single gene is controlled by transcription 
factors that bind to its promoter region and tune the 
level of its expression. While taking a simple element 
out of its network will not be illuminating about how 
the network as a whole behaves in general - especially if 
there are feedback loops - there may be cases where the 
information flow is "bottlenecked" through a single gene, 
and its reliability will therefore limit that of the network. 
In addition, the analysis of a simple regulatory element 
will provide directions for taking on more complicated 
systems; see Ref [11] for a recent related analysis. 

Our aim is therefore to understand the reliability of 
a simple genetic regulatory element, that is, of a single 
activator or repressor transcription factor controlling the 
expression level of its downstream gene. We will identify 
the concentration c of the transcription factor as the only 
input, X = {c}, and the expression level of the down- 
stream gene g as the only relevant output, O = {g}. 
The regulatory element itself will be parametrized by 
input/output kernel, p{g\c)^ i.e. the distribution (as op- 
posed to a "deterministic" function g = g{c) in case of 
a noiseless system) of possible outputs given that the 
input is fixed to some particular level c. For each such 
kernel, we will then compute the maximum amount of 
information, /(c; ^), that can be transmitted through it, 
and examine how this capacity depends on the properties 
of the kernel. 



II. MAXIMIZING INFORMATION 
TRANSMISSION 

The input /output kernel of a simple regulatory el- 
ement, p{g\c), is determined by the biophysics of 
transcription factor-DNA interaction, transcription and 
translation. In contrast, the distribution of inputs, p(c), 
that the cell uses during its "typical" lifetime, is free 
for the cell to change. The cell's transcription factor 
expression footprint is its representation of the envi- 
ronment and internal state, and the form of this rep- 
resentation can be the target of adaptation or evolution- 
ary processes. Together, the input/output kernel and 
the distribution of inputs define the joint distribution, 
p{c^g) = p{g\c)p{c)^ and consequently the mutual in- 
formation of Eq M between the input and the output, 
nc;g). 

Maximizing the information between the inputs and 
outputs, which corresponds to our notions of reliabil- 
ity in representation and computation, will therefore im- 
ply a specific matching between the given input/output 
kernel and the distribution of inputs, p(c), that is be- 
ing optimized. If one believes that a specific regula- 
tory element has been tuned for maximal information 
transmission, then the optimal solution for the inputs, 
p*(c), and the resulting optimal distribution of outputs. 



0.12 




FIG. 1: (Color online) A schematic diagram of a simple 
regulatory element. Each input is mapped to a mean out- 
put according to the input/output relation (thick sigmoidal 
black line). Because the system is noisy, the output fluc- 
tuates about the mean. This noise is plotted in gray as a 
function of the input and shown in addition as error bars on 
the mean input/output relation. Inset shows the probability 
distribution of outputs at half saturation, p{g\c = Kd) (red 
dotted lines) ; in this simple example we assume that the dis- 
tribution is Gaussian and therefore fully characterized by its 
mean and variance. 



P^id) = / dcp{g\c)p*{c)^ become experimentally verifi- 
able predictions. If, on the other hand, the system is not 
really maximizing information transmission, then the ca- 
pacity achievable with a given kernel and its optimal in- 
put distribution, /[p(^|c),p*(c)], can still be regarded as 
a (hopefully revealing) upper bound on the true infor- 
mation transmission of the system. 

During the past decades the measurements of regula- 
tory elements have focused on recovering the mean re- 
sponse of a gene under the control of a transcription fac- 
tor that had its activity modulated by experimentally 
adjustable levels of inducer or inhibitor molecules [12]. 
Typically, a sigmoidal response is observed with a single 
regulator, as in Fig[l] and more complicated regulatory 
"surfaces" are possible when there are two or more simul- 
taneous inputs to the system [TS] [14]. In our notation, 
these experiments measure the conditional average over 
the distribution of outputs, g{c) = / dggp{g\c). De- 
velopments in flow cytometry and single-cell microscopy 
enabled the experimenters to start tracking in time and 
across the population of cells the expression levels of flu- 
orescent reporter genes and thus open a window into 
the behavior of fluctuations. Consequently, work explor- 
ing the noise in gene expression, or cr^(c) = J dg {g — 
g)'^p{g\c)^ has begun to accumulate, on both the exper- 
imental and biophysical modeling side |T5l IW, TT, TS] . 
The efforts to further characterize and understand this 
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noise were renewed by theoretical work by Swain and 
coworkers [19j that has shown how to separate intrin- 
sic and extrinsic components of the noise, i.e. the noise 
due to the stochasticity of the observed regulatory pro- 
cess in a single cell, and the noise contribution that 
arises because typical experiments make many single-cell 
measurements and the internal chemical environments of 
these cells differ across the population. 



A. Small noise approximation 

We start by showing how the optimal distributions 
can be computed analytically if the input/output kernel 
is Gaussian and the noise is small, and proceed by pre- 
senting the exact numerical solution later. Let us assume 
then that the first and second moments of the conditional 
distribution are given, and write the input /output ker- 
nel as a set of Gaussian distributions Q{g; g{c), cr^(c)), or 
explicitly: 



P{9\c) = 



1 



2^a2(c) 



: exp 



[g - m? 



(2) 



where both the mean response, ^(c), and the noise, cr^(c), 
depend on the input, as illustrated in Fig[l] 

We rewrite the mutual information between the input 
and the output of Eq ([T]) in the following way: 

I{c]g) = J dcp{c) J dgp{g\c)\og2p{g\c) - 

- J dcp{c) J dgp{g\c)\og2p{g). (3) 

The first term can be evaluated exactly for Gaussian dis- 
tributions, p{g\c) = Q{g;g{c)^ag{c)). The integral over 
g is just the calculation of the (negative of the) entropy 
of the Gaussian, and the first term therefore evaluates 
to -{S[g{g;g,ag])p^c) = -|_(log2 27recr2(c))^(c). 

In the second term of Eq p|, the integral over g can be 
viewed as calculating (log2p(^)) under the distribution 
p{g\c). For an arbitrary continuous function f{g) we can 
expand the integrals with the Gaussian measure around 
the mean: 



im) 



I 

jdg9ig)% 



(g-g) 



dp 

dgO{g)g^ 



5f + ---(4) 



The first term of the expansion simply evaluates to f{g). 
The series expansion would end at the first term if we 
were to take the small noise limit, limcTg^o G{g'-,g^crg) = 
^{9 ~ 9)' The second term of the expansion is zero 
because of symmetry, and the third term evaluates to 



^agf"{g). We apply the expansion of Eq (p| and com- 
pute the second term in the expression forlhe mutual 
information, Eq (|3|, with f{g) = log2p(^). Taking only 
the zeroth order of the expansion, we get 



I{c;g) 



J 



dcp{c) 



^og2 p{g{c)) 



(5) 

we can rewrite the probability distributions in terms of 
^, using p{c) dc = p{g) dg. To maximize the informa- 
tion transmission we form the following Lagrangian and 
introduce the multiplier A that keeps the resulting dis- 
tribution normalized: 



dgp{g)\og2 (y2Treag{g)p{g)^-A J dgp{g). 

(6) 

The optimal solution is obtained by taking a variational 
derivative with respect to p(^), "^^^^^^^^ = 0- The solution 
is 



P*{g) 



1 



1 



(7) 



By inserting the optimal solution, Eqj7|, into the ex- 
pression for mutual information, Eq ([3^7 we get an ex- 
plicit result for the capacity: 



^pt(c;^) = log2 



(8) 



where Z is the normahzation of the optimal solution in 
Eq 0: 



' dg 

<^g{g)' 



(9) 



The optimization with respect to the distribution of 
inputs, p(c), has led us to the result for the optimal dis- 
tribution of mean outputs^ Eq We had to assume 
that the input/output kernel is Gaussian and that the 
noise is small, and we refer to this result as the small- 
noise approximation (SNA) for channel capacity. Note 
that in this approximation only the knowledge of the 
noise in the output as a function of mean output, (Jg{g), 
matters for capacity computation and the direct depen- 
dence on the input c is irrelevant. This is important 
because the behavior of intrinsic noise as a function of 
the mean output is an experimentally accessible quantity 
[16 . Note also that for big enough noise the normaliza- 
tion constant Z will be small compared to V27re, and the 
small-noise capacity approximation of Eq (|8| will break 
down by predicting negative information values. 



B. Large noise approximation 

Simple regulatory elements usually have a monotonic, 
saturating input/output relation, as shown in Fig[l] and 
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FIG. 2: (Color online) An illustration of the large noise ap- 
proximation. We consider distributions of the output at min- 
imal (cmin) and full (cmax) iuductlon as trying to convey a sin- 
gle binary decision, and construct the corresponding encoding 
table (inset) by discretizing the output using the threshold 
G. The capacity of such an asymmetric binary channel is de- 
graded from the theoretical maximum of 1 bit, because the 
distributions overlap (blue and red). For undipped Gaus- 
sians the optimal threshold G is at the intersection of two 
alternative pdfs, but in general one searches for the optimal 
G that maximizes information in Eq (11). 



(at least) a shot noise component whose variance scales 
with the mean. If the noise strength is increased, the 
information transmission must drop and, even with the 
optimally tuned input distribution, eventually yield only 
a bit or less of capacity. Intuitively, the best such a noisy 
system can do is to utilize only the lowest and highest 
achievable input concentrations, and ignore the continu- 
ous range in between. Thus, the mean responses will be 
as different as possible, and the noise at low expression 
will also be low because it scales with the mean. More 
formally, if only {cmin,Cmax} are used as inputs, then 
the result is either p{g\cmin) or p{g\cma^); the optimiza- 
tion of channel capacity reduces to finding ^(cmin), with 
p(cmax) = 1 — p(cmin)- This problem can be solved by re- 
alizing that each of the two possible input concentrations 
produces their respective Gaussian output distributions, 
and by maximizing information by varying p{c^in)- Sim- 
plifying even further, we can threshold the outputs and 
allow g to take on only two values instead of a continu- 
ous range; then, each of the two possible inputs, "min" 
and "max", maps into two possible outputs, "on" and 
"off" , and confusion in the channel arises because "min" 
input might be misunderstood as "on" output and vice 
versa with probabilities given by the output distribution 
overlaps, as shown schematically in Fig|2j 

In the latter case we can use the analytic formula for 
the capacity of the binary asymmetric channel. If 77 is the 
probability of detecting an "off" output if "max" input 
was sent, and ^ is a probability of receiving an "off" 
output if "min" input was sent, and H{-) is a binary 



entropy function: 

H{p) = -plog^p- (1 -p)log2(l -p), (10) 
then the capacity of such asymmetric channel is [20] : 

/(c;,) = =^^^f^^ + log,(l + 2^^). (11) 

Because this approximation reduces the continuous dis- 
tribution of outputs to only two choices, "on" or "off", 
it can underestimate the true channel capacity and is 
therefore a lower bound. 

C. Exact solution 



The information between the input and output in Eq 
(|3| can be maximized numerically for any input /output 
kernel, p{g\c), if the variables c and g are discretized, 
making the solution space that needs to be searched, 
p{ci)^ finite. One possibility is to use a gradient descent- 
based method and make sure that the solution procedure 
always stays within the domain boundaries ^^p{ci) = 
> for every j. Alternatively, a procedure 
known as Blahut-Arimoto algorithm has been derived 
specifically for the purpose of finding optimal channel 
capacities [21 . Both methods yield consistent solutions, 
but we prefer to use the second one because of faster 
convergence and convenient inclusion of constraints on 
the cost of coding (see Appendix [A| for details). 

One should be careful in interpreting the results of 
such naive optimization and worry about the artifacts in- 
troduced by discretization of input and output domains. 
After discretization, the formal optimal solution is no 
longer required to be smooth and could, in fact, be com- 
posed of a collection of Dirac-delta function spikes. On 
the other hand, the real, physical concentration c cannot 
be tuned with arbitrary precision in the cell; it is a re- 
sult of noisy gene expression, and even if this noise source 
were removed, the local concentration at the binding site 
would still be subject to fluctuations caused by random- 
ness in diffusive flux [22j |23]. The Blahut-Arimoto al- 
gorithm is completely agnostic as to which (physical) 
concentrations belong to which bins after concentration 
has been discretized, and so it could assign wildly dif- 
ferent probabilities to concentration bins that differ in 
concentration by less than (Jc (i.e. the scale of local con- 
centration fluctuations), making such a naive solution 
physically unrealizable. In Appendix [A] we describe how 
to properly use Blahut-Arimoto algorithm despite the 
difficulties induced by discretization. 



III. A MODEL OF SIGNALS AND NOISE 

If enough data were available, one could directly sam- 
ple p{g\c) and proceed by calculating the optimal solu- 
tions as described previously. Here we start, in contrast. 



5 



by assuming a Gaussian model of Eq (pi), in which the 



mean, ^(c), and the output variance, cr^(c), are functions 
of the transcription factor concentration, c. Our goal for 
this section is to build an effective microscopic model 
of transcriptional regulation and gene expression, and 
therefore define both functions with a small number of 
biologically interpret able parameters. In the subsequent 
discussion we plan to vary those and thus systematically 
observe the changes in information capacity. 

In the simplest picture, the interaction of the TF with 
the promoter site consists of binding with a (second or- 
der) rate constant /c+ and unbinding at a rate In a 
somewhat more complicated case where h TF molecules 
cooperatively activate the promoter, the analysis still re- 
mains simple as long as the favorable interaction energy 
between the TFs is sufficient to make only the fully oc- 
cupied (and thus activated) and the empty (and thus 
inactivated) states of the promoter likely; this effective 
two-state system is once more describable with a single 
rate for switching off the promoter, and the corre- 
sponding activation rate has to be ex (see Ref in 
particular Appendix B). Generally, therefore, the equi- 
librium occupancy of the site will be: 



(12) 



where the Hill coefficient, /i, captures the effects of co- 
operative binding, and is the equilibrium constant of 
binding. The mean expression level g is then: 



9{c) 90 9 90 



1 



activator 
n repressor 



(13) 



where g has been normalized to vary between and 1, 
and go is the maximum expression level. In what follows 
we will assume the activator case, where g = and 
present the result for the repressor at the end. 

The ffuctuations in occupancy have a (binomial) vari- 
ance cr^ = n(l— n) and a correlation time Tc = l/(/c+c^ + 
k-) [23 . If the expression level of the target gene is ef- 
fectively determined by the average of the promoter site 
occupancy over some window of time Tint, then the con- 
tribution to variance in the expression level due to the 
"on-off" promoter switching will be: 



n(l — n) n(l — n)^ 

Vint (^+C^ + k-)Tint k-Tint 



(14) 

where in the last step we use the fact that /c+c^(l — n) = 
k-Ti. 

At low TF concentrations the arrival times of single 
transcription factor molecules to the binding site are ran- 
dom events. Recent measurements [24] seem to be con- 
sistent with the hypothesis that this variability in diffu- 
sive ffux contributes an additional noise term [22j [23] . 
similar to the Berg-Purcell limit to chemoattractant de- 
tection in chemotaxis. The noise in expression level due 
to ffuctuations in the binding site occupancy, or the total 



input noise, is therefore a sum of this diffusive component 
(see Eq (11) of Ref [23 ) and the switching component 
ofEqpl): 



n(l-n)2 h^{l-ny 



input 



-'?~int 



nDacTir 



(15) 



where D is the diffusion constant for the TF and a is the 
receptor site size, a ~ 3 nm for a typical binding site on 
the DNA. 

To compute the information capacity in the small 
noise limit using the simple model developed so far we 
need the constant Z from Eq ([9|, which is defined as 
an integral over expression levels. As both input noise 
terms are proportional to (1 — ^)^, the integral must take 
the form: 



Z (X 



{^-9)ngy 



(16) 



where F{g) is a function that approaches a constant as 
g ^ 1. Strangely, we see that this integral diverges near 
full induction (^ = 1), which means that the information 
capacity also diverges. 

Naively we expect that modulations in transcription 
factor concentration are not especially effective at trans- 
mitting regulatory information once the relevant binding 
sites are close to complete occupancy. More quantita- 
tively, the sensitivity of the site occupancy to changes in 
TF concentration, dn/dc^ vanishes as n ^ 1, and hence 
small changes in TF concentration will have vanishingly 
small effects. Our intuition breaks down, however, be- 
cause in thinking only about the mean occupancy we 
forget that even very small changes in occupancy could 
be effective if the noise level is sufficiently small. As 
we approach complete saturation, the variance in occu- 
pancy decreases, and the correlation time of ffuctuations 
becomes shorter and shorter; together these effects cause 
the standard deviation as seen through an averaging time 
Tint to decrease faster than dn/dc, and this mismatch 
is the origin of the divergence in information capacity. 
Of course the information capacity of a physical system 
can't really be infinite; there must be an extra source of 
noise (or reduced sensitivity) that becomes limiting as 
n ^ 1. 



The noise in Eq (15) captures only the input noise, 
i.e. the noise in the protein level caused by the fiuctua- 
tions in the occupancy of the binding site. In contrast, 
the output noise arises even when the occupancy of the 
binding site is fixed (for example, at full induction), and 
originates in the stochasticity in transcription and trans- 
lation. The simplest model postulates that when the ac- 
tivator binding site is occupied with fractional occupancy 
n, mRNA molecules are synthesized in a Poisson process 
at a rate that generates ReTeU mRNA molecules on 
average during the lifetime of a single mRNA molecule, 
Tg. Every message is a template for the production of 
proteins, which is another Poisson process with rate Kg. 
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Parameter 


Value 


Description 


a 


(l + ^)/^o 


Output noise strength 




/TiDaKdTint 


Diffusion input noise strength 


7 


(/C-Tint)-' 


Switching input noise strength 


h 




Cooperativity (Hill coefficient) 



TABLE I: Gaussian noise model parameters. Note that if 
burst size h ^ 1^ then the output noise is determined by 
the average number of mRNA molecules, a ~ ((mRNA))~^. 
Note further that if the on-rate is diffusion limited, i.e. 
A;+ = 4:7rDa, then both input noise magnitudes, (3 and 7, 
are proportional and decrease with increasing k- , or alterna- 
tively, with increasing Kd = k-/k^. 



If the integration time is larger than the lifetime of sin- 
gle mRNA molecules, Tint ^ Tg, the mean number of 
proteins produced is ^ = RgTintRe^e^ = 9o^^ and the 
variance associated with both Poisson processes is [23j : 



output 



1 + RgTe 
90 



n, 



(17) 



where b = RgTe is the burst size, or the number of pro- 
teins synthesized per mRNA. 

We can finally put the results together by adding the 
input noise Eq (15) and the output noise Eq (17), and 



expressing both in terms of the normalized expression 
level g{c): 



ag- 



P{i-g?^^-f-^-^7g{i-g)^ (18) 



ag 



+ p{l-gy-^f^^ ^7f{^-g). (19) 

with the relevant parameters {a, /3, 7, /i} explained in Ta- 
ble [T[ Note that both repressor and activator cases differ 
only in the shape of the input noise contributions (es- 
pecially for low cooperativity h). Note further that the 
output noise increases monotonically with mean expres- 
sion ^, while the input noise peaks at the intermediate 
levels of expression. To make the examination of the pa- 
rameter space in the next section feasible, we set 7 = 0; 
models with switching noise instead of diffusive noise 
produce qualitatively similar results. 



IV. RESULTS 
A. Capacity of simple regulatory elements 

Having at our disposal both a simple model of signals 
and noise and a numerical way of finding the optimal 
solutions given an arbitrary input /output kernel, we are 



now ready to examine the channel capacity as a function 
of the noise parameters from Table [T[ Our first result, 
shown in Fig[3j concerns the simplest case of an activator 
with no cooperativity, h = 1; for this case, the noise in 
Eq (18) simplifies to: 



^ =ag + (3{l-grg. 
9o 



(20) 



Here we have assumed that there are two relevant sources 
of noise, i.e. the output noise (which we parametrize by 
a and plot on the horizontal axis) and the input diffusion 
noise (parametrized by vertical axis). Each point of 
the noise plane in Fig therefore represents a system 
characterized by a Gaussian noise model, Eq ([2|, with 
variance given by Eq (20) above. 



As expected, the capacity increases most rapidly when 
the origin of the noise plane is approached approximately 
along its diagonal, whereas along each of the edges one 
of the two noise sources effectively disappears, leaving 
the system dominated by either output or input noise 
alone. We pick two illustrative examples, the blue and 
the red systems of Figs ^jp and [Sj), that have realistic 
noise parameters. The blue system has, apart for the 
decreased cooperativity {h = 1 instead of = 5), the 
characteristics of the Bicoid-Hunchback regulatory ele- 
ment in Drosophila melanogaster ^23l [25]; the red sys- 
tem is dominated by output noise with characteristics 
measured recently for about 40 yeast genes [2T. We 
would like to emphasize that both the small-noise ap- 
proximation and the exact solution predict that these 
realistic systems are capable of transmitting more than 
1 bit of regulatory information and that they, indeed, 
could transmit up to about 2 bits. In addition, we are 
also reminded that while the distributions (for example 
the optimal output distribution in FigjsjD) can look bi- 
modal and this has often been taken as an indication 
that there are two relevant states of the output, such 
distributions really can have capacities above 1 bit; sim- 
ilarly, distributions without prominent features, such as 
monotonically decreasing optimal output distribution of 
FigjSj^, should also not be expected necessarily to have 
low capacities. 

A closer look at the overall agreement between the 
small-noise approximation (dashed lines in Fig[3|i) and 
the exact solution (thick lines) shows that the small- 
noise approximation underestimates the true capacity, 
consistent with our remark that for large noise the ap- 
proximation will incorrectly produce negative results; at 
the 2-bit information contour the approximation is about 
~ 15% off but improves as the capacity is increased. 

In the high noise regime we are making yet another 
approximation, the validity of which we now need to ex- 
amine. In our discussion about the models of signals and 
noise we assumed that we can talk about the fractional 
occupancy of the binding site and the continuous con- 
centrations of mRNA, transcription factors and protein, 
instead of counting these species in discrete units, and 
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FIG. 3: (Color online) Information capacity (color code, in 
bits) as a function of input and output noise using the acti- 
vator input/output relation with Gaussian noise given by Eq 
( [2Q| and no cooperativity (/i = 1). Panel A shows the exact 
capacity calculation (thick line) and the small noise approxi- 
mation (dashed line). Panel B displays the details of the blue 
point in A: the noise in the output is shown as a function of 
the input, with a peak being characteristic of a dominant in- 
put noise contribution; also shown is the exact solution (thick 
black line) and the small-noise approximation (dashed black 
line) to the optimal distribution of output expression levels. 
Panel C similarly displays details of the system denoted by 
a red dot in A; here the output noise is dominant and both 
approximate and exact solutions for the optimal distribution 
of outputs show a trend monotonically decreasing with the 
mean output. 



that noise can effectively be treated as Gaussian. Both of 
these assumptions are the cornerstones of the Langevin 
approximation for calculating the noise variance [26 . If 
parameters a and /3 actually arise due to the underly- 
ing microscopic mechanisms described in the Section [lll| 
on signals and noise, we expect that at least for some 
large-noise regions of the noise plane the discreteness 
in the number of mRNA molecules will become impor- 
tant and the Langevin approximation will fail. In such 
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FIG. 4: Difference in the information capacity between the 
repressors and activators (color code in bits). Panel A shows 
Irep{h = 1) — Iact{h = 1), with the noise model that includes 
output (a) and input diffusion noise {f3) contributions (see 
Fig [s] for absolute values of Iact{h — 1)). Panel B shows 
-^rep — ^act for the uoisc model that includes output noise 
[a) and input switching noise (7) contributions; this plot is 
independent of cooperativity, h. 



cases (a much more time-consuming) exact calculation 
of the input /output relations using the Master equation 
is possible for some noise models (see Appendix [b|) ; we 
show that in the region where log a > —2 the chan- 
nel capacities calculated with Gaussian kernels can be 
overestimated by ~ 10% or more; there the Langevin 
calculation gives the correct second moment, but misses 
the true shape of the distribution. Although both exam- 
ples with realistic noise parameters, B and C, of Fig [3] 
lie safely in the region where Langevin approximation is 
valid, care should be used whenever both output noise 
a and burst size are large, and a is consequently domi- 
nated by small number of transcripts. 

Is there any difference between activators and repres- 
sors in their capacity to convey information about the 
input? We concluded Section [Hi] on the noise models 
with separate expressions for activator noise, Eq (18), 
and repressor noise, Eq (19); focusing now on the re- 
pressor case 



we recompute the information in the same 
manner as we did for the activator in Fig [3^, and display 
the difference between the capacities of the repressor and 
activator with the same noise parameters in Fig [4j As 
expected, the biggest difference occurs above the main 
diagonal, where the input noise dominates over the out- 
put noise. In this region the capacity of the repressor 
can be bigger by as much as third than that of the cor- 
responding activator. Note that as ^ oo, the activator 
and repressor noise expressions become indistinguishable 
and the difference in capacity vanishes for the noise mod- 
els with output and input diffusion noise contributions. 



Eqs (18, 19). 



The behavior of the regulatory element is conveniently 
visualized in Fig [5] by plotting a cut through the noise 
plane along its main diagonal. Moving along this cut 
scales the total noise variance of the system up or down 
by a multiplicative factor, and allows us to observe the 
overall agreement between the exact solution and small- 
and large-noise approximations. In addition we point 
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FIG. 5: (Color online) Comparison of exact channel capac- 
ities and various approximate solutions. For both panels 
(panel A, no cooperativity, /i = 1; panel B, strong coop- 
erativity, /i = 3) we take a cross-section through the noise 
plane in Fig |3] along the main diagonal, where the values 
for noise strength parameters a and [3 are equal. The exact 
optimal solution is shown in red. By moving along the di- 
agonal of the noise plane (and along the horizontal axis in 
the plots above) one changes both input and output noise 
by the same multiplicative factor s, and since, in small-noise 
approximation, /sna oc logZ, Z — J ag{g)~^ dg, that factor 
results in an additive change in capacity by log2 s. We can 
use the large noise approximation lower bound on capacity 
for the case h — 1, in the parameter region where capacities 
fall below 1 bit. 



out the following interesting features of Fig [5] that will 
be examined more closely in subsequent sections. 

First, the parameter region in non-cooperative case, 
in which the capacity falls below one bit and the large 
noise approximation is applicable, is small and shrinks 
further at higher cooperativity. This suggests that a bio- 
logical implementation of a reliable binary channel could 
be relatively straightforward, assuming our noise models 
are appropriate. Moreover, there exist distributions not 
specifically optimized for the input/output kernel, such 
as the input distribution uniform in \og{c/Kd) that we 
pick as illustrative example in Fig [s] (thick black line) ; 
we find that this simple choice can achieve considerable 
information transmission, and are therefore motivated 
to raise a more general question about the sensitivity 
of channel capacity with respect to perturbations in the 
optimal solution, p*(c). We revisit this idea more sys- 
tematically in the next section. 

Second, it can be seen from Fig [5] that at small noise 
the cooperativity has a minor effect on the channel ca- 
pacity. This is perhaps unexpected as the shape of the 
mean response g{c) strongly depends on h. We recall, 
however, that mutual information /(c; g) is invariant to 
any invertible reparametrization of either g or c. In 
particular, changing the cooperativity or the value of 
the equilibrium binding constant, K^^ in theory only 
results in an invertible change in the input variable c, 
and therefore the change in the steepness or midpoint of 
the mean response must not have any effect on I{c;g). 
This argument does break down in the high noise regime, 
where the cooperative system achieves capacities above 



one bit while the non-cooperative system fails to do so. 
Reparametrization invariance would work only if the in- 
put concentration could extend over the whole positive 
interval, from zero to infinity. The substantial difference 
between capacities of cooperative and non-cooperative 
systems in Fig |5] at low capacity stems from the fact 
that in reality the cell (and our computation) is limited 
to a finite range of concentrations, c G [cmin^Cmax], in- 
stead of the whole positive half-axis, c G [0, oo) . We 
explore the issue of limited input dynamic range further 
in the next section. 

Finally, we draw attention to the simple linear scal- 
ing of the channel capacity with the logarithm of the 
total noise strength in small noise approximation, as ex- 
plained in the caption of Fig [5] In general, increasing 
the number of input and output molecules by a factor of 
four will decrease the relative input and output noise by 
a factor of a/4 = 2, and therefore, in the small noise ap- 
proximation, increase the capacity by log2 2 = 1 bit. If 
one assumes that the cell can make transcription factor 
and output protein molecules at no cost, then scaling of 
the noise variance along the horizontal axis of Fig [5] IS 
inversely proportional to the total number of signaling 
molecules used by the regulatory element, and its capac- 
ity can grow without bound as more and more signaling 
molecules are used. If, however, there are metabolic or 
time costs to making more molecules, our optimization 
needs to be modified appropriately, and we present the 
relevant computation in Section [iVCjon the costs of cod- 
ing. 



B. Cooperativity, dynamic range and the tuning of 
solutions 



In the analysis presented so far we have not paid any 
particular attention to the question of whether the opti- 
mal input distributions are biologically realizable or not. 
We will proceed to relax some of the idealizations made 
until now and analyze the corresponding changes in the 
information capacity. 

We start by considering the impact on channel ca- 
pacity of changing the allowed dynamic range to which 
the input concentration is restricted. Figure displays 
the capacity as a function of the dynamic range, output 
noise and cooperativity. The main feature of the plot 
is the difference between the low and high cooperativity 
cases at each noise level; regardless of cooperativity the 
total information at infinite dynamic range would sat- 
urate at approximately the same value (which depends 
on the output noise magnitude). However, highly coop- 
erative systems manage to reach a high fraction of 80% 
or more of their saturated information capacity even at 
reasonable dynamic ranges of 25 to 100-fold (meaning 
that the input concentration varies between [^Kd^^Kd] 
or [j^Kd^lOKd], respectively), whereas low cooperativ- 
ity systems require a much bigger dynamic range for the 
same effect. The decrease in capacity with decreasing 
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FIG. 6: (Color online) Effects of imposing realistic con- 
straints on the space of allowed input distributions. Panel 
A shows the change in capacity if the dynamic range of 
the input around Kd is changed ("25-fold range" means 
c G \_\Kd^ ^Kd\). The regulatory element is a repressor with 
either no cooperativity (dashed line) or high cooperativity, 
/i = 3 (thick line). We plot three high-low cooperativity 
pairs for different choices of the output noise magnitude (high 
noise in light gray, log a ^ —2.5; medium noise in dark gray, 
log a ~ —5; low noise in black, log a ~ —7.5). Panel B shows 
the sensitivity of channel capacity to perturbations in the 
optimal input distribution. For various systems from Fig |3] 
we construct suboptimal input distributions, as described in 
the text, compute the fraction of capacity lost relative to the 
unperturbed optimal solution and plot this fraction against 
the optimal capacity of that system (black dots); extrapo- 
lated absolute capacity left when the input tends to be very 
different from optimal, i.e. i^js ^ 1, is plotted in red. 



dynamic range is a direct consequence of the nonlinear 
relationship between the concentration and occupancy, 



Eq (12), and for low cooperativity systems means being 
unable to fully shut down or fully induce the promoter. 
In theory, Eq (18) predicts that ag{g{c) ^ 0) = 0, mak- 



ing the state in which the gene is "off" very informative 
about the input. If, however, the gene cannot be fully 
repressed either because there is always some residual 
input, Cmin, or because there is leaky expression even 
when the input is exactly zero, then at any biologically 
reasonable input dynamic range some capacity will be 
lost. 

Next, we briefly discuss how precisely tuned the result- 
ing optimal distributions have to be to take full advan- 
tage of the regulatory element's capacity. For each point 
in the noise plane of Fig the optimal input distribu- 
tion p*(c) is perturbed many times to create an ensemble 
of suboptimal inputs Pi (c) (see Appendix [C| . For each 
Pi{c)^ we compute, first, its distance away from the op- 
timal solution by means of Jensen- Shannon divergence, 
di = Djs{pi^p*) [29 ; next, we use the Pi{c) to compute 
the suboptimal information transmission 7^. The diver- 
gence di is a measure of similarity between two distribu- 
tions and ranges between (distributions are the same) 
and 1 (distributions are very different); l/di{pi^p^) ap- 
proximately corresponds to the number of samples one 
would have to draw to say with confidence that they were 
selected either from pi or p*. A scatter plot of many 



such pairs (di^Ii) obtained with various perturbations 
Pi (c) for each system of the noise plane characterizes the 
sensitivity of the optimal solution for that system; the 
main feature of such a plot, Fig|9j is the linear (negative) 
slope that describes the fraction of channel capacity lost 
for a unit of Jensen-Shannon distance away from the op- 
timal solution. Figure (6)3 displays these fractions as a 
function of the optimal capacity, and each system from 
the noise plane shown in Fig |3] is represented by a black 
dot. We note that systems with higher capacities require 
more finely tuned solutions and suffer a larger fractional 
(and thus absolute) loss if the optimal input distribution 
is perturbed. Importantly, if the linear slopes are taken 
seriously and are used to extrapolate towards distribu- 
tions that are very different from optimal, I^js ^ 1, we 
observe that for most of the noise plane the leftover ca- 
pacity still remains about a bit, indicating that biological 
regulatory elements capable of transmitting an "on-off" 
decision perhaps are not difficult to construct. On the 
other hand, transmitting significantly more than one bit 
requires some degree of tuning that matches the distri- 
bution of inputs to the characteristics of the regulatory 
element. 



C. Costs of higher capacity 

Real regulatory elements must balance the pressure to 
convey information reliably with the cost of maintaining 
the cell's internal state, represented by the expression 
levels of transcription factors. The fidelity of the rep- 
resentation is increased (and the fractional fluctuation 
in their number is decreased) by having more molecules 
"encode" a given state. On the other hand, making or 
degrading more transcription factors puts a metabolic 
burden on the cell, and frequent transitions between var- 
ious regulatory states could involve large time lags as, for 
example, the regulation machinery attempts to keep up 
with a changed environmental condition, by accumulat- 
ing or degrading the corresponding TF molecules. In 
addition, the output genes themselves that get switched 
on or off by transcription factors and therefore "read 
out" the internal state must not be too noisy, otherwise 
the advantage of maintaining precise transcription factor 
levels is lost. 

Suppose that there is a cost to the cell for each 
molecule of output gene that it needs to produce, and 
that this incremental cost per molecule is independent 
of the number of molecules already present. Then, 
on the output side, the cost must be proportional to 
(g) = J dggp{g). We remember that in optimal distri- 
bution calculations g is expressed as relative to the max- 
imal expression, such that its mean is between zero and 
one. To get an absolute cost in terms of the number of 
molecules, this normalized g therefore needs to be multi- 
plied by the inverse of the output noise strength, as 
the latter scales with go (see Table [l|. The contribution 
of the output cost is thus cx a~^g. 
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On the input side, the situation is similar: the cost 
must be proportional to Kd{c) = J dccp{c), where 
our optimal solutions are expressed, as usual, in dimen- 
sionless concentration units, c = c/Kd- In either of the 
two input noise models (i.e. diffusion or switching input 
noise), with diffusion constant held fixed, Kd ex I3~^ or 
Kd (X 7~^. See Appendix [p] for the notes on the ef- 
fects of non-specific binding of transcription factors to 
the DNA. 

Collecting all our thoughts on the costs of coding, we 
can write down the "cost functional" as the sum of input 
and output cost contributions: 



{CW)]) 



Vl 

13 



J dcp{c)c+^ J dcp{c) J dgp{g\c)g, 



(21) 

where Vi and V2 are proportional to the unknown costs 
per molecule of input or output, respectively, and a and 
P are noise parameters of Table |Tj This ansatz captures 
the intuition that while decreasing noise strengths will 
increase information transmission, it will also increase 
the cost. Instead of maximizing the information without 
regard to the cost, the new problem to extremize is: 

C\p{c)] = I\p{c)] - ${Cb(c)]) - dcp{c), (22) 

and the Lagrange multiplier $ has to be chosen so 
that the cost of the resulting optimal solution (C[p*(c)]) 
equals some predefined cost Co that the cell is prepared 
to pay. 

We now wish to recreate the noise plane of Fig[3j while 
constraining the total cost of each solution to Co- To be 
concrete and pick the value for the cost and proportion- 
ality constants in Eq (21), we use the estimates from 



Drosophila noise measurements and analysis in [23J [24] , 
which assign to the system denoted by a blue dot in 
Fig |3^, the values of ~ 800 Bicoid molecules of input 
at Kd, and a maximal induction of go ~ 4000 Hunch- 
back molecules if the burst size b were 10. Figure [7^ 
is the noise plane for an activator with no cooperativ- 
ity, as in Fig|3| but with the cost limited to an average 
total of Co ~ 7000 molecules of input and output per 
nucleus. There is now one optimal solution denoted by a 
green dot (with a dominant input noise contribution); if 
one tries to choose a system with lower input or output 
noise, the cost constraint forces the input distribution, 
p(c), and the output distribution, p{g), to have very low 
probabilities at high induction, consequently limiting the 
capacity. 

Clearly, a different system will be optimal if another 
total allowed cost Co is selected. The dark green line on 
the noise plane in Fig [7^ corresponds to the flow of the 
optimal solution for an activator with no cooperativity 
if the allowed cost is increased, and the corresponding 
cost-capacity curve is shown in Fig[7]3. The light green 
line is the trajectory of the optimal solution in the noise 
plane of the activator system with cooperativity = 3, 
and the dark and light red trajectories are shown for 
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FIG. 7: (Color online) The effects of metabolic or time costs 
on the achievable capacity of simple regulatory elements. 
Contours in panel A show the noise plane for non-cooperative 
activator from Fig |3] with the imposed constraint that the 
average total (input + output) cost is fixed to some Co; as 
the cost is increased, the optimal solution (green dot) moves 
along the arrows on a dark green line (the contours change 
correspondingly, not shown). Light green line shows activa- 
tor with cooperativity /i = 3, dark and light red lines show 
repressors without and with cooperativity (/i = 3). Panel B 
shows the achievable capacity as a function of cost for each 
line in panel A. 



the repressor with h = 1 and = 3, respectively. We 
note first that the behavior of the cost function is quite 
different for the activator (where low input implies low 
output and therefore low cost; and conversely high input 
means high output and also high cost) and the repressor 
(where input and output are mutually exclusively high 
or low and the cost is intermediate in both cases). Sec- 
ondly, in Fig [7]3 we observe that the optimal capacity 
as a function of cost is similar for the activators and re- 
pressors, in contrast to the comparison of Fig [4j where 
repressors provided higher capacities. Thirdly, we note 
in the same figure that increasing the cooperativity at 
fixed noise strength /3 brings a substantial increase, of 
almost a bit over the whole cost range, in the channel 
capacity, in agreement with our previous observations 
about the interaction between capacity and the dynamic 
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range. The last and perhaps the most significant con- 
clusion is that even with input distributions matched to 
maximize the transmission at a fixed cost, the capac- 
ity still only scales roughly linearly with the logarithm 
of the number of available signaling molecules, and this 
fact must ultimately be limiting in a single regulatory 
element. 



V. DISCUSSION 

We have tried to analyze a simple regulatory element 
as an information processing device. One of our major 
results is that one cannot discuss an element in isola- 
tion from the statistics of the input that it is exposed 
to. Yet in cells the inputs are often transcription factor 
concentrations that "encode" the state of various genetic 
switches, from those responsible for cellular identity to 
those that control the rates of metabolism and cell di- 
vision, and the cell exerts control over these concentra- 
tions. While it could use different distributions to rep- 
resent various regulatory settings, we argue that the cell 
should use the one distribution that allows it to make 
the most of its genetic circuitry - the distribution that 
maximizes the dependency, or mutual information, be- 
tween inputs and outputs. Mutual information can then 
be seen both as a measure of how well the cell is doing 
by using its encoding scheme, and the best it could have 
done using the optimal scheme, which we can compute; 
comparison between the optimal and measured distribu- 
tions gives us a sense of how close the organism is to the 
achievable bound [25 . Moreover, mutual information 
has absolute units, i.e. bits, that have a clear interpre- 
tation in terms of the number of discrete distinguishable 
states that the regulatory element can resolve. This last 
fact helps clarify the ongoing debates about what is the 
proper noise measure for genetic circuits, and in what 
context a certain noise is either "big" or "small" (as it is 
really a function of the inputs). Information does not re- 
place the standard noise- over-the-mean measure - noise 
calculations or measurements are still necessary to com- 
pute the element's capacity - but does give it a functional 
interpretation. 

We have considered a class of simple parametrizations 
of signals and noise that can be used to fit measurements 
for several model systems, such as Bicoid-Hunchback in 
the fruit fiy, a number of yeast genes, and the lac re- 
pressor in Escherichia coli (see Ref [25] for the latter). 
We find that the capacities of these realistic elements 
are generally larger than 1 bit, and can be as high as 
2 bits. By simple inspection of optimal output distri- 
butions in Figs or [S]^ it is difficult to say anything 
about the capacity: the distribution might look bimodal 
yet carry more than one bit, or might even be a mono- 
tonic function without any obvious structure, indicating 
that the information is encoded in the graded response 
of the element. When the noise is sufficiently high, on 
the other hand, the optimal strategy is that of achieving 



one bit of capacity and only utilizing maximum and min- 
imum available levels of transcription factors for signal- 
ing. The set of distributions that achieve capacities close 
to the optimal one is large, suggesting that perhaps one- 
bit switches are not difficult to implement biologically, 
while in contrast we find that transmission of much more 
than one bit requires some tuning of the system. 

Finally, we discussed how additional biophysical con- 
straints can modify the channel capacity. By assuming 
a linear cost model for signaling molecules and a limited 
input dynamic range, the capacity and cost couple in an 
interesting way and the maximization principle allows 
new questions to be asked. For example, increasing the 
cooperativity reduces the cost, as we have shown; on the 
other hand, it increases the sensitivity to fiuctuations in 
the input, because the input noise strength jS is propor- 
tional to the square of Hill's coefficient, . In a given 
system we could therefore predict the optimal effective 
cooperativity, if we knew the real cost per molecule. Fur- 
ther work is needed to tease out the consequences of cost 
(if any) from experimental data. 

The principle of information maximization clearly is 
not the only possible lens through which regulatory net- 
works are to be viewed. One can think of examples where 
there are constraints on the dynamics^ something that 
our analysis has ignored by only looking at steady state 
behavior; for instance, the chemotactic network of Es- 
cherichia coli has to perfectly adapt in order for the bac- 
terium to be able to climb the attract ant gradients. Al- 
ternatively, suppose that a system needs to convey only 
a single bit, but it has to be done reliably in a fluctuating 
environment, perhaps by being robust to the changes in 
outside temperature. In this case it seems that both con- 
cepts, that of maximal information transmission and the 
robustness to fluctuations in certain auxiliary variables 
which also influence the noise, could be included into 
the same framework, but the issue needs further work. 
More generally, however, these and similar examples as- 
sume that one has identifled in advance the biologically 
relevant features of the system, e.g. perfect adaptation 
or robustness^ and that there exists a problem-speciflc 
error measure which the regulatory network is trying to 
minimize. Such a measure could then either replace or 
complement the assumption-free information theoretic 
approach presented here. 

We emphasize that the kind of analysis carried out 
here is not restricted to a single regulatory element. As 
was pointed out in the introduction, the inputs X and 
the outputs O of the regulatory module can be multi- 
dimensional, and the module could implement complex 
internal logic with multiple feedback loops. It seems 
that especially in such cases, when our intuition about 
the noise - now a function of multiple variables - starts 
breaking down, the information formalism could prove 
to be helpful. Although the solution space that needs 
to be searched in the optimization problem grows ex- 
ponentially in the inputs, there are biologically relevant 
situations that nevertheless appear tractable: for exam- 
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pie, when there are multiple readouts of the same input, 
or combinatorial regulation of a single output by a pair 
of inputs; in addition, knowing that the capacities of 
a single input/output chain are on the order of a few 
bits also means that only a small number of distinct in- 
put levels for each input need to be considered. Some 
cases of interest therefore appear immediately amenable 
to biophysical modeling approaches and the computation 
of channel capacities, as presented in this paper. 

We have focused here on the theoretical exploration of 
information capacity in simple models. It is natural to 
ask how our results relate to experiment. Perhaps the 
strongest connection would be if biological systems re- 
ally were selected by evolution to optimize information 
flow in the sense we have discussed. If this optimization 
principle applies to real regulatory elements, then, for 
example, given measurements on the input /output rela- 
tion and noise in the system we can make parameter free 
predictions for the distribution of expression levels that 
cells will use. Initial efforts in this direction, using the 
Bicoid-Hunchback element in the Drosophila embryo as 
an example, are described in Ref [25 . It is worth not- 
ing that a parallel discussion of optimization principles 
for information transmission has a long history in the 
context of neural coding, where we can think of the dis- 
tribution on inputs as given by the sensory environment 
and optimization is used to predict the form of the in- 
put/output relation [ 30l [3TJ [32l [33] . Although there are 
many open questions, it would be attractive if a single 
principle could unify our understanding of information 
flow across such a wide range of biological systems. 



(since p{g) = X]c^(^l^)^(^))- "^^^ solution J9*(c) of this 
problem achieves the capacity, I(c;^), of the channel. 

The original idea behind the Blahut- Arimoto approach 
[21 was to understand that the maximization of Eq ( |A1[ ) 
using variational objects p{ci) is equivalent to the follow- 
ing maximization: 

max£[p(c)] ~ max max jC'[p{c)^p{c\g)], (A2) 

p(c) p{c) pic\g) 



where 



/:'[p{c),p{c\g)] = '^p{c)p{g\c) log 



9jC 



p{c\g) 

p{c) 



A^p(c). 
(A3) 

In words, finding the extremum in variational object p{c) 
is equivalent to a double maximization of a modified La- 
grangian, where both p{c) and p{c\g) are treated as inde- 
pendent variational objects. The extremum of the modi- 
fied Lagrangian is achieved exactly when the consistency 



condition p{c\g) 



p{g\c)p{c) 



holds. This allows us to 



^^Pig\c)p{c) 

ma ke an iterative algorithm that we detail below, where 
Eq (A3) is solved for the optimal p{c) and evaluated at 
some "known" p{c\g), which is in turn updated with the 
newly obtained estimate of p{c). 

Before describing the algorithm let us also suppose 
that each input signal c carries some metabolic or time 
cost to the cell. Then we can introduce a cost vector 
v{c) that assigns a cost to each codeword c, and require 
of the solution the following: 



(A4) 
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APPENDIX A: FINDING OPTIMAL CHANNEL 
CAPACITIES 

If we treat the kernel on a discrete (c, g) grid we can 
easily choose such p{c) as to maximize the mutual in- 
formation l{c;g) between the expression level and the 
concentration. The problem can be stated in terms of 
the following variational principle: 



where Cq is the maximum allowed expense. The con- 



^P{c)] = ^ p(^ I c)p(c) log. 



P{9\c) 

Pig) 



A^p(c), (Al) 



where the multiplier A enforces the normalization ofp(c), 
and p{g) itself is a function of the unknown distribution 



straint can be introduced into the functional, Eq (Al) 



or Eq (A3), through an appropriate Lagrange multiplier; 



the same approach can be taken to introduce the cost of 
coding for the output words, Z]c^(^l^)^(^)'^(^)' 
cause it reduces to an additional "effective" cost for the 
input, v{c) = Y.gP{9\^)^{9)- 

As was pointed out in the main text, after discretiza- 
tion we have no guarantees that the optimal distribution 
p{ci) is going to be smooth. One way to address this 
problem is to enforce the smoothness on the scale set by 
the precision at which the input concentration can be 
controlled by the cell, crc(c), by p enalizing big deriva- 
tives in the Lagrangian of Eq ( |A3[ ). An alternative way 
is to find the spiky solution (without imposing any direct 
penalty term), but interpret it not as a real, "physical" 
concentration distribution, but rather as the distribution 
of concentrations that the cell attempts to generate, c*. 
In this case, however, the limited resolution of the in- 
put, cTcfc), must be referred to the output as an addi- 

.ion., eLc.ive„o.ei„,e„e»p„.io„, ,-,2 (1)1 

The optimal solution p(c*) is therefore the distribution 
of the levels that the cell would use if it had infinitely 
precise control over choosing various c* (i.e. if the input 
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noise were absent), but the physical concentrations are 
obtained by convolving this optimal result p(c*) with a 
Gaussian of width crc(c*). Although we chose to use the 
second approach to compute the results of this paper, 
we will, for completeness, describe next how to include 
the smoothness constraint into the functional explicitly. 

If the smoothness of the input distribution p{c) is ex- 
plicitly constrained in the optimization problem, then 
it will be controllable through an additional Lagrange 
multiplier, and both ways of computing the capacity - 
that of referring the limited input resolution ac{c) to 
the noise in the output, and that of including it as a 
smoothness constraint on the input distribution - will be 
possible within a single framework. We proceed by anal- 
ogy to field theories in which the kinetic energy terms 
of the form J |V/(x)p dx constrain the gradient magni- 
tude, and form the following functional: 

£[p(c)] = I{c;g)-\oY,P{c)- (A5) 

C 

- <i>i^p{c)vi{c)-<i>2^p{g)v2{g)-{A6) 



Eq (A5) maximizes the capacity with respect to varia- 



Ac 



tional object s p{c ) while keeping the distribution nor- 
malized; Eq ( |A6[ ) imposes cost vi{c) on inp ut sy mbols 
and cost V2{g) on output symbols; finally, Eq ( A7) limits 
the derivative of the resulting solution. The difference 
operator A is defined for an arbitrary function /(c): 



A/(c) = /(ci+i)-/(cO. 



(A8) 



<j(c) assigns a different weight to various intervals on 
the input axis, c. If the input cannot be precisely con- 
trolled, but has an uncertainty of a{c) at mean input 
level c, we require that the optimal probability distribu- 
tion must not change much as the input fiuctuates on 
the scale cr(c). In other words, we require for each input 
concentration that: 



Sp=-^a{c) < 1; 
Ac 



(A9) 



the term in Eq | A7| constrained by Lagrange multiplier 
can be seen as the sum of squares of such variations over 
(A7) all possible values of the input. 



By differentiating the functional, Eq (A3), that includes the relevant constraints, with respect to p{ci) we get the 
following equation: 



= ^p(^|Ci)logp(Ci|^) -logp(Ci) - A-<l>iVl(Ci) -<l>2^p(^|ci)v2(^) 

9 9 



+ e <^ [p(ci+i) -p{ci 



(AlO) 
(All) 



Let us denote by F(c,p(c)) = At^^ct^ the term in braces. The solution for p{c) is therefore given by: 



p{c) 



\ exp \ ^p(^|c) logp(cl^) - $1^1 (c) - ^2 ^p(^g\c)v2[g) + eF(c,p(c)) 



(A12) 



We can now continue to use the Blahut-Arimoto trick of pretending that p(c\g) is an independent variational object, 
and that p(c) has to be solved with p(c\g) held fixed; however, even in that case, Eq (A12) is an implicit equation 
for p(c) which needs to be solved by numerical means. The complete iterative prescription is therefore as follows: 



C 

p(5|c)p"(c) 



P"(c|5) 



9"+l(c) 



(A13) 
(A14) 



i exp I Y^vW) logp"(c|5) - *i«i(c) - $2 5]p(5|c)«2(5) + eF(c,p"+i(c)) I . (A15) 

\ 9 9 ) 



Again, Eq ( A15 ) has to be solved on its own by numer- 
ical means as the variational objects for iteration (n + 1) 
appear both on the left- and right-hand sides. The in- 



put and output costs of coding are neglected if one sets 
$1 = $2 = 0; likewise, smoothness constraint is ignored 
for = 0, in which case Eq (A15) is the same as in the 
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original Blahut-Arimoto derivation and it gives the value 
ofp^~^^{c) explicitly. 

For the capacities computed in this paper we have 
calculated the effective output noise that includes the 
intrinsic output noise as well as the input noise that has 
been referred to the output (see Section III); we can 
therefore set 6 = 0. This approach treats all sources 
of noise on the same footing and allows us to directly 
compare the magnitudes of noise sources at the input and 
the output. We also note that it makes sense to compute 
and compare the optimal distribution of outputs rather 
than inputs: the input/output kernels are degenerate 
and there are various input distributions (differing either 
in the regions that give saturated or zero response, or by 
having variations on a scale below dc) that will yield 
essentially the same distribution of outputs. 



APPENDIX B: VALIDITY OF LANGEVIN 
APPROXIMATIONS 

Langevin approximation assumes that the fluctuations 
of a quantity around its mean are Gaussian and pro- 
ceeds to calculate their variance [26 . For the calculation 
of exact channel capacity we must calculate the full in- 
put/output relation, p{g\c). Even if Langevin approach 
ends up giving the correct variance as the function of the 
input, cr^(c), the shape of the distribution might be far 
from Gaussian. We expect such a failure when the num- 
ber of mRNA is very small: the distribution of expres- 
sion levels might be then multi-peaked, with peaks corre- 
sponding to 6, 26, 36, . . . proteins, where b is the burst size 
(number of proteins produced per transcript lifetime). 

In the model used in Eq ( 18 ) , parameter a = {l-\-b)/go 



determines the output noise; go = bfh^ where fh is the 
average number of transcripts produced during the inte- 
grating time (i.e. the longest averaging timescale in the 
problem, for example the protein lifetime or cell doubling 
time). If 6 1, then the output noise is effectively deter- 
mined only by the number of transcripts, a ~ 1/m. We 
should therefore be particularly concerned what happens 
as fh gets small. 

Our plan is therefore to solve for p{g\c) exactly by find- 
ing the stationary solution of the Master equation in the 
case where the noise consists of the output and switch- 
ing input contributions. In this approach, we explicitly 
treat the fact that the number of transcribed messages, 
designated by m, is discrete. We start by calculating 
Pi{m\c^t). The state of the promoter is described in- 
dex i, which can be or 1, depending on whether the 
promoter is bound by the transcription factor or not, re- 
spectively. Normalization requires that for each value of 



(Bl) 



following set of equations for an activator: 

dpo{m\c,t) 



dt 



Re {po{rn - l\c,t) -po{m\c,t)) 

- - {mpo{m\c,t) - (m + l)po{rn + l|c, t)) 
r 

- k-Po{m\c,t) ^ k^cpi{m\c,t), (B2) 
\: ^ = (mpi(m|c,t)-(m + l)pi(m + l|c,t)) 

Ot T 

+ k-po{m\c,t) — k-^cpi{m\c^t)^ (B3) 

where r is the integrating time, k- is the rate for switch- 
ing into the inactive state (off-rate of the activator), /c+ 
is the second-order on-rate, and Re is the rate of mRNA 
synthesis. These constants combine to give fh = ReT 
and the input switching noise strength 7 = (/c_r)~^, see 
Table [ij This set of equations is supplemented by appro- 
priate boundary conditions for m = 0. To find a steady 
state distribution p{m\c^t 00) = p(m|c), we set the 
left-hand side to zero and rewrite the set of equations 
(with high enough cutoff value of mmax) in matrix form: 



M(c)p(c) = b, 



(B4) 



2=0,1 m 



The time evolution of the system is described by the 



where p = (po(0|c),pi(0|c),po(l|c),pi(l|c), • • •) and b = 
(0, 0, • • • , 0, 1). Matrix M (of dimension 2(mmax + 1) + 1 
rows and 2 (mmax + 1) columns) contains, in its last 
row, only ones, which enforces normalization. The re- 
sulting system is a non-singular band-diagonal system 
that can be easily inverted. The input /output rela- 
tion for the number of messages is given by taking 
p{m\c) = pQ{m\c) +pi(m|c). 

Having found the distribution for the number of tran- 
scripts we then convolve it another Poisson process, 

V{9\{9) = brn), i.e. p{g\c) Em^(^kM^K^) ^^)- 
Finally, the result is rediscretized such that mean expres- 
sion g runs from to 1. 

Note that the Langevin approximation only depends 
on the combination of the burst size b and the mean num- 
ber of transcripts fh through a; in contrast, the Master 
equation solution depends on both b and fh indepen- 
dently. The generalization of this calculation to repres- 
sors or Hill-coefficient-type cooperativity is straightfor- 
ward. 

FigjSj: shows that the Langevin approximation yields 
correct second moments of the output distribution; how- 
ever, Gaussian distributions themselves are, for large 
burst sizes and small number of messages, inconsistent 
with the exact solutions, as can be seen in Fig|8^. In the 
opposite limit where the number of messages is increased 
and burst size kept small, see Fig [sJd, normal distribu- 
tions are an excellent approximation. Despite these dif- 
ficulties the information capacity calculated with either 
Gaussian or Master input /output relations differs by at 
most 12% over a large range of burst sizes b and values 
for a, illustrated by Figlsli. 
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FIG. 8: (Color online) Exact solutions (black) for in- 
put/output relations, p{g\c), compared to their Gaussian ap- 
proximations (gray). Panel A shows the distribution of out- 
puts at maximal induction, p{g\cmax) for a system with a 
large burst size, 5 = 5^ and a large output noise a = | 
(i.e. the average number of messages is 6, as is evident 
from the number of peaks, each of which corresponds to a 
burst of translation at different number of messages). Panel 
B shows the same distribution for smaller output noise, 
6 = 5^ and a = here Gaussian approximation per- 
forms well. Both cases are computed with switching noise 
parameter 7 = and cooperativity of /i = 2. Panel 
C shows in color-code the error made in computing the 
standard deviation of the output given c; the error mea- 
sure we use is the maximum difference between the exact 
and Gaussian results over the full range of concentrations: 

maXcabs{[crp(c)/^o]Master - [crp(c)/5fo] Gaussian}. As CXpCCtcd 

the error decreases with decreasing output noise. Panel D 
shows that the capacity is overestimated by using an approx- 
imate kernel, but the error again decreases with decreasing 
noise as Langevin becomes an increasingly good approxima- 
tion to the true distribution. In the worst case the approxi- 
mation is about 12% off. Gaussian computation only depends 
on a and not separately on burst size, so we plot only one 
curve for b — 1. 



APPENDIX C: FINE-TUNING OF OPTIMAL 
DISTRIBUTIONS 



To examine the sensitivity to the perturbations in the 
optimal input distributions for Fig [6] we need to gen- 
erate an ensemble of perturbations. We pick an ad 
hoc prescription, whereby the optimal solution is taken, 
and we add to it 5 lowest harmonic modes on the in- 
put domain, each with a weight that is uniformly dis- 
tributed on some range. The range determines whether 
the perturbation is small or not. The resulting distri- 
bution is clipped to be positive and renormalized. This 
choice was made to induce low-frequency perturbations 
(high frequency perturbations get averaged out because 
the kernel is smooth). Then, for an ensemble of 100 



such perturbations, Pi(c), i = 1, . . . , 100, and for every 
system of the noise plane in Fig |3^, the divergence of 
the perturbed input distribution to the true solution, 
di = I)js(pi(c),p*(c)), is computed, as well as the infor- 
mation transmission, li = I[p{g\c),pi{c)]. Figure [9] plots 
the (di^Ii) scatter plots for 3 x 3 representative systems 
with varying amounts of output {a) and input (/?) noise, 
taken from Fig [3^ uniformly along the horizontal and 
vertical axes. 
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FIG. 9: Robustness of the optimal solutions to perturbations 
in the input distribution. Activator systems with no cooper- 
ativity are plotted; their parameters are taken from an uni- 
formly spaced, 3x3 grid of points in the noise plane of Fig 
|3^, such that the output noise increases along the horizon- 
tal edge of the figure and the input noise along the vertical 
edge. Each subplot shows a scatter plot of 100 perturbations 
from the ideal solution; the Jensen-Shannon distance from 
the optimal solution, di, is plotted on the horizontal axis and 
the channel capacity (normalized to maximum when there is 
no perturbation), /i//max, on the vertical axis. Red lines are 
best linear fits. 



Figure |9] shows that as we move towards systems with 
higher capacity (lower left corner), perturbations to the 
optimal solution that are at the same distance from the 
optimum as in the low capacity systems (upper right 
corner), will cause greater relative loss (and therefore 
an even greater absolute loss) in capacity. As expected, 
higher capacity systems must be better tuned, but even 
for the highest capacity system considered, a perturba- 
tion of around djs ^0.2 will only cause an average 15% 
loss in capacity. We also note that for systems with high 
capacity the linear relationship between the the diver- 
gence di and capacity li provides a better fit than for 
systems with small capacity. 
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APPENDIX D: NONSPECIFIC BINDING 

One needs to make a careful distinction between the 
total concentration of the input transcription factors, q, 
and the free concentration cj, diffusing in solution in 
the nucleus. We imagine the true binding site embed- 
ded in a pool of non-specific binding sites - perhaps all 
other short fragments of DN A - and there being an ongo- 
ing competition between one functional site (with strong 
affinity) and large number of weaker non-specific sites. 
If these non-specific sites are present at concentration p 
in the cell, and have affinities drawn from some distri- 
bution p{K)^ the relationship between the free and the 
total concentration of the input is: 



because it directly determines both the promoter occu- 



ct = Cf + pjdKp{K)^^. 



(Dl) 



Importantly, the concentration that enters all informa- 
tion capacity calculations is the free concentration c/. 



pancy in Eq (12) as well as diffusive noise; on the other 
hand, the cell can inffuence the free concentration only 
by producing more or less of the transcrption factor, i.e. 
by varying (and paying for) the total concentration. If 
the free concentration is well below the strength of the 
non-specific binding, (i^T), Eq (Dl) can be approximated 
by Q ~ Cf{l-\-p/{K))^ with the total and free concentra- 
tions being proportional to each other. Because the cost 
functional, Eq ( [21] ), is only determined to within a factor 
anyway, the presence of non-specific sites will effectively 
just rescale the cost per free molecule of transcription 
factor. A separate calculation is needed to show that 
the presence of non-specific binding does not apprecia- 
bly increase the noise in gene expression (to be presented 
elsewhere). 
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